This R Markdown will be exploring the housing sale prices in King County, USA between the time period May 2014 and May 2015.

After calling all the required packages and installing them if needed, loading and reading the dataset, I will go through a data visualization / an exploratory data analysis (EDA) to identify the most important features and the correlation between them.

Secondly, Feature Engineering will be conducted. It is the process of using domain knowledge of the data to create features which, if done correctly, will increase the predictive power of the machine learning algorithms by creating features from raw data that help facilitate the machine learning process.

Finally, I will be applying different machine learning algorithms and evaluating their respective success to a cross-validated splitted train-test set.

  1. Calling all the required packages and installing them if needed

The train-test splitting function that will be used later on

splitdf <- function(dataframe, seed=NULL) {
  if (!is.null(seed)) set.seed(seed)
  index <- 1:nrow(dataframe)
  trainindex <- sample(index, trunc(length(index)/1.5))
  trainset <- dataframe[trainindex, ]
  testset <- dataframe[-trainindex, ]
  list(trainset=trainset,testset=testset)
}
  1. Data loading and preprocessing:
  1. Loading the datasets from my gist account. Reading the train and test datasets.
train <- read.csv (url("https://gist.githubusercontent.com/CelineKhoury/53ddc466a65e9ae1cbaa2773638b87b9/raw/3baf60131c21bb35782e043a5bc77a6f386184a4/house_price_train.csv"))
test <- read.csv (url("https://gist.githubusercontent.com/CelineKhoury/2ea2e63bf418d91270812f2a1f30ef2b/raw/8f6a575a60da419e0a8dc029ad17d1db235939a8/house_price_test.csv"))

summary(train)
##        id                    date           price            bedrooms     
##  Min.   :1.000e+06   23/06/2014:  118   Min.   :  78000   Min.   : 1.000  
##  1st Qu.:2.114e+09   25/06/2014:  113   1st Qu.: 320000   1st Qu.: 3.000  
##  Median :3.902e+09   26/06/2014:  110   Median : 450000   Median : 3.000  
##  Mean   :4.566e+09   22/04/2015:  108   Mean   : 539865   Mean   : 3.369  
##  3rd Qu.:7.303e+09   08/07/2014:  104   3rd Qu.: 645500   3rd Qu.: 4.000  
##  Max.   :9.900e+09   14/04/2015:  102   Max.   :7700000   Max.   :33.000  
##                      (Other)   :16622                                     
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.500   Min.   :  370   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1430   1st Qu.:   5050   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7620   Median :1.500  
##  Mean   :2.114   Mean   : 2080   Mean   :  15186   Mean   :1.493  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10695   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1164794   Max.   :3.500  
##                                                                   
##    waterfront            view          condition         grade      
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 3.00  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.00  
##  Median :0.000000   Median :0.0000   Median :3.000   Median : 7.00  
##  Mean   :0.007467   Mean   :0.2335   Mean   :3.413   Mean   : 7.66  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.00  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :13.00  
##                                                                     
##    sqft_above   sqft_basement       yr_built     yr_renovated    
##  Min.   : 370   Min.   :   0.0   Min.   :1900   Min.   :   0.00  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.00  
##  Median :1564   Median :   0.0   Median :1975   Median :   0.00  
##  Mean   :1791   Mean   : 289.4   Mean   :1971   Mean   :  85.35  
##  3rd Qu.:2210   3rd Qu.: 556.0   3rd Qu.:1997   3rd Qu.:   0.00  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.00  
##                                                                  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 460  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1986  
##  3rd Qu.:98117   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##                                                                 
##    sqft_lot15    
##  Min.   :   659  
##  1st Qu.:  5100  
##  Median :  7639  
##  Mean   : 12826  
##  3rd Qu.: 10080  
##  Max.   :871200  
## 
summary(test)
##        id                    date         bedrooms      bathrooms    
##  Min.   :5.200e+06   09/07/2014:  35   Min.   :1.00   Min.   :0.500  
##  1st Qu.:2.193e+09   23/04/2015:  30   1st Qu.:3.00   1st Qu.:1.500  
##  Median :3.972e+09   23/07/2014:  30   Median :3.00   Median :2.250  
##  Mean   :4.637e+09   28/04/2015:  30   Mean   :3.39   Mean   :2.123  
##  3rd Qu.:7.367e+09   20/05/2014:  29   3rd Qu.:4.00   3rd Qu.:2.500  
##  Max.   :9.842e+09   16/07/2014:  28   Max.   :9.00   Max.   :6.750  
##                      (Other)   :4138                                 
##   sqft_living      sqft_lot           floors      waterfront     
##  Min.   : 420   Min.   :    649   Min.   :1.0   Min.   :0.00000  
##  1st Qu.:1430   1st Qu.:   5026   1st Qu.:1.0   1st Qu.:0.00000  
##  Median :1920   Median :   7586   Median :1.5   Median :0.00000  
##  Mean   :2081   Mean   :  14755   Mean   :1.5   Mean   :0.00787  
##  3rd Qu.:2530   3rd Qu.:  10654   3rd Qu.:2.0   3rd Qu.:0.00000  
##  Max.   :9200   Max.   :1651359   Max.   :3.5   Max.   :1.00000  
##                                                                  
##       view          condition         grade          sqft_above  
##  Min.   :0.0000   Min.   :1.000   Min.   : 4.000   Min.   : 420  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000   1st Qu.:1200  
##  Median :0.0000   Median :3.000   Median : 7.000   Median :1550  
##  Mean   :0.2375   Mean   :3.399   Mean   : 7.651   Mean   :1780  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000   3rd Qu.:2210  
##  Max.   :4.0000   Max.   :5.000   Max.   :13.000   Max.   :6720  
##                                                                  
##  sqft_basement       yr_built     yr_renovated        zipcode     
##  Min.   :   0.0   Min.   :1900   Min.   :   0.00   Min.   :98001  
##  1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.00   1st Qu.:98033  
##  Median :   0.0   Median :1975   Median :   0.00   Median :98070  
##  Mean   : 301.1   Mean   :1971   Mean   :  80.92   Mean   :98079  
##  3rd Qu.: 600.0   3rd Qu.:1997   3rd Qu.:   0.00   3rd Qu.:98118  
##  Max.   :3500.0   Max.   :2015   Max.   :2015.00   Max.   :98199  
##                                                                   
##       lat             long        sqft_living15    sqft_lot15    
##  Min.   :47.16   Min.   :-122.5   Min.   : 399   Min.   :   651  
##  1st Qu.:47.48   1st Qu.:-122.3   1st Qu.:1490   1st Qu.:  5100  
##  Median :47.57   Median :-122.2   Median :1840   Median :  7556  
##  Mean   :47.56   Mean   :-122.2   Mean   :1989   Mean   : 12489  
##  3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360   3rd Qu.: 10116  
##  Max.   :47.78   Max.   :-121.4   Max.   :5790   Max.   :858132  
## 
  1. Create the column “price” in the test set to be able to combine both datasets
dim(train)
## [1] 17277    21
dim(test)
## [1] 4320   20
test$price <- NA
  1. Create the column “dataType” in both TRAIN and TEST sets and assign the value ‘train’ & ‘test’. This will help us split the dataset from the dataType and not by position.
train$dataType <- "train"
test$dataType <- "test"
  1. Merging the TRAIN and TEST datasets
dim(train)
## [1] 17277    22
dim(test)
## [1] 4320   22
identical(names(train), names(test))
## [1] FALSE
dataset <- rbind(train, test)

summary(dataset)
##        id                    date           price            bedrooms     
##  Min.   :1.000e+06   23/06/2014:  142   Min.   :  78000   Min.   : 1.000  
##  1st Qu.:2.123e+09   25/06/2014:  131   1st Qu.: 320000   1st Qu.: 3.000  
##  Median :3.905e+09   26/06/2014:  131   Median : 450000   Median : 3.000  
##  Mean   :4.580e+09   08/07/2014:  127   Mean   : 539865   Mean   : 3.373  
##  3rd Qu.:7.309e+09   27/04/2015:  126   3rd Qu.: 645500   3rd Qu.: 4.000  
##  Max.   :9.900e+09   25/03/2015:  123   Max.   :7700000   Max.   :33.000  
##                      (Other)   :20817   NA's   :4320                      
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.500   Min.   :  370   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1430   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.116   Mean   : 2080   Mean   :  15099   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10685   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##                                                                   
##    waterfront            view          condition        grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.00   Min.   : 3.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.00   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.00   Median : 7.000  
##  Mean   :0.007547   Mean   :0.2343   Mean   :3.41   Mean   : 7.658  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.00   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.00   Max.   :13.000  
##                                                                     
##    sqft_above   sqft_basement       yr_built     yr_renovated    
##  Min.   : 370   Min.   :   0.0   Min.   :1900   Min.   :   0.00  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.00  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.00  
##  Mean   :1789   Mean   : 291.7   Mean   :1971   Mean   :  84.46  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.00  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.00  
##                                                                  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##                                                                 
##    sqft_lot15       dataType        
##  Min.   :   651   Length:21597      
##  1st Qu.:  5100   Class :character  
##  Median :  7620   Mode  :character  
##  Mean   : 12758                     
##  3rd Qu.: 10083                     
##  Max.   :871200                     
## 
str(dataset)
## 'data.frame':    21597 obs. of  22 variables:
##  $ id           : num  9.18e+09 4.64e+08 2.22e+09 6.16e+09 6.39e+09 ...
##  $ date         : Factor w/ 372 levels "01/02/2015","01/03/2015",..: 155 329 219 355 277 241 201 199 319 255 ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
##  $ dataType     : chr  "train" "train" "train" "train" ...

We have 20 columns and the target variable is the “price”. Let’s analyze the features that we have to check if we need to clean or preprocess them.

  1. Check if we have null or missing values by column:
colSums(is.na(dataset))
##            id          date         price      bedrooms     bathrooms 
##             0             0          4320             0             0 
##   sqft_living      sqft_lot        floors    waterfront          view 
##             0             0             0             0             0 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             0             0             0             0             0 
##  yr_renovated       zipcode           lat          long sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15      dataType 
##             0             0
str(dataset)
## 'data.frame':    21597 obs. of  22 variables:
##  $ id           : num  9.18e+09 4.64e+08 2.22e+09 6.16e+09 6.39e+09 ...
##  $ date         : Factor w/ 372 levels "01/02/2015","01/03/2015",..: 155 329 219 355 277 241 201 199 319 255 ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
##  $ dataType     : chr  "train" "train" "train" "train" ...
  1. Outlier Analysis:

We will visualize all numeric columns against the house price.

numeric_var <- names(train)[which(sapply(train, is.numeric))]

for (i in seq_along(numeric_var)) {
      plt <- ggplot(data = train, aes_string(numeric_var[i], 'price')) +
    geom_point(color = "red") +
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
  print(plt)
}

Removing outliers manually according to the visualization of every variable against the target variable

 train <- subset (train, bedrooms < 20)
 train <- subset (train, sqft_living < 11000)
  1. Correlation plot
dfCor <- cor(train[sapply(train, is.numeric)])
corrplot(dfCor)

  1. Data visualization / EDA:
## Check for the Distribution of the price

g1<-ggplot(dataset,aes(x=price))+geom_density(fill="blue")
g2<-ggplot(dataset,aes(x=sqft_living))+geom_histogram(binwidth=1,fill="green")
grid.arrange(g1,g2,nrow=1,ncol=2)
## Warning: Removed 4320 rows containing non-finite values (stat_density).

## The distribution of house prices & sqft_living is rightly skewed, so let's apply log10() and then plot the distribution again

dataset<-dataset %>% 
             mutate(log_price=log10(price))
ggplot(dataset,aes(x=log_price))+geom_histogram(fill="green",binwidth=0.10)
## Warning: Removed 4320 rows containing non-finite values (stat_bin).

dataset<-dataset %>% 
             mutate(log_size=log10(sqft_living))
ggplot(dataset,aes(x=log_size))+geom_histogram(fill="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Check for the distribution of the price against the bedrooms variable.

ggplot(dataset,aes(x=bedrooms,y=log_price,col=bedrooms))+
geom_point(alpha=0.5,size=2)+
geom_smooth(method="lm",se=F)+
labs("title=Sqft Living vs Price")+scale_color_gradientn(colors='yellow')+theme(legend.position="none")
## Warning: Removed 4320 rows containing non-finite values (stat_smooth).
## Warning: Removed 4320 rows containing missing values (geom_point).

Check for the distribution of the bathrooms variable.

ggplot(dataset,aes(x=bathrooms))+geom_histogram(fill="green4",binwidth=0.5,size=0.1)+
         scale_x_continuous(limits=c(1,8))
## Warning: Removed 75 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Check for the distribution of the house prices in respect to the condition of the house.

ggplot(dataset,aes(factor(condition),log_price,fill=factor(condition)))+
geom_boxplot(alpha=0.6)+scale_fill_manual(values=rainbow(6))+
theme(legend.position="none")+
labs(x="House Condition")
## Warning: Removed 4320 rows containing non-finite values (stat_boxplot).

Interpretation of the boxplot: From the boxplot, it is very clear that having a high house condition means that the price of the house will be higher.

Creating a new variable ‘houseview’, indicating that if the house has a view of the waterfront it is true, else it is false.

dataset$houseview<-ifelse(dataset$waterfront==1,TRUE,FALSE)

dataset %>%
  select(log_price, houseview) %>%
  glimpse()
## Observations: 21,597
## Variables: 2
## $ log_price <dbl> 5.352183, 5.807027, 5.908485, 5.518514, 5.724276, 5....
## $ houseview <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
ggplot(dataset, aes(x = houseview, y = log_price,fill=houseview)) +
  geom_boxplot(alpha=0.5) +
  labs(x = "waterfront", y = "logprice")+
scale_fill_manual(values=rainbow(n=12))
## Warning: Removed 4320 rows containing non-finite values (stat_boxplot).

Interpretation of the boxplot: The houses that have a view of the waterfront tend to be much more expensive than the houses that don’t have a view of the waterfront.

  1. Feature engineering :

1- Creating a new variable to check if there is an outdoor area in the lot which is given by the difference between the sqft_lot and the sqft_living.

2- First extract the year from the date sold (yr_sold). Then, create a new variable which indicates the age of the house, given from the difference between the year of the date where the house was sold and the year the house was built.

If the difference is equal to zero it means that the house is new.

3- Creating a new variable ‘renovated’ indicating if the house is renovated or not.

4- Since zipcodes are a representation of the area, the average price per zipcode and the number of houses per zipcode will be added to the dataset (keep in mind that all the values added were only taken from the train set since practically the information from the test set would not be available).

The dataset will be splited here once just to be able to separate test from train and not have invalid information. Please keep in mind this split is only to complete feature engineering.

5- In order to make these variables more precise the average per zipcode should not include the price of the house that we are training on, so this formula will be applied ((avg * cnt ) - price) / (cnt - 1))

6- Since we don’t have the prices of the final test set the price avg took an NA value but it is replaceable by the whole mean.

7- Calculating the grade and condition with respect to the neighbors ( grade - avg_grade(per_zip) ) / std_grade(per_zip) and ( condition - avg_condition(per_zip)) / std_condition(per_zip) , please keep in mind that in this case the grade and condition are available informations even from the test set.

# 1

dataset$outdoor <- dataset$sqft_lot - dataset$sqft_living

# 2

dataset$yr_sold <- as.character(dataset$date)
dataset$yr_sold <- substring(dataset$yr_sold, 7, 10)

# if the difference is equal to zero it means that the house is new 

dataset$yr_sold <- as.numeric (dataset$yr_sold)
dataset$yr_built <- as.numeric (dataset$yr_built)
dataset$house_age <- dataset$yr_sold - dataset$yr_built

# 3 

dataset <- dataset %>% mutate(renovated = ifelse(yr_renovated == 0, 0, 1))
table(dataset$renovated)
## 
##     0     1 
## 20683   914
# 4

test_bfe <- dataset[ dataset$dataType == 'test',]
train_bfe <- dataset[ dataset$dataType == 'train',]
m <- splitdf(train_bfe , seed = 20)
train_set <- m$trainset

avg_price <- train_set[,c('log_price' , 'zipcode') ] %>% group_by(zipcode)%>% summarise(log_price_avg = mean(log_price) , house_cnt =length(log_price) )
dataset <- join( dataset , avg_price , by ='zipcode')

# 5 

dataset$log_price_avg_adj <- ((dataset$log_price_avg * dataset$house_cnt) - dataset$log_price) / (dataset$house_cnt - 1)

# 6 
dataset$log_price_avg_adj[is.na(dataset$log_price_avg_adj)] <- dataset$log_price_avg[is.na(dataset$log_price_avg_adj)]

dataset <- dataset[,-which(names(dataset) == 'log_price_avg')]

# 7 

grade_subs <- dataset[,c('grade' , 'condition' , 'zipcode') ] %>% group_by(zipcode)%>% summarise(grade_avg = mean(grade) , grade_std =sd(grade) ,condition_avg = mean(condition) , condition_std =sd(condition)   )

subset <- dataset[,c('grade' , 'condition' , 'zipcode') ]

subset <- join( subset , grade_subs , by ='zipcode')

subset$resp_grade <- (subset$grade - subset$grade_avg) / subset$grade_std
subset$resp_cond <- (subset$condition - subset$condition_avg) / subset$condition_std
dataset$resp_grade <- subset$resp_grade
dataset$resp_cond<- subset$resp_cond


str(dataset)
## 'data.frame':    21597 obs. of  33 variables:
##  $ id               : num  9.18e+09 4.64e+08 2.22e+09 6.16e+09 6.39e+09 ...
##  $ date             : Factor w/ 372 levels "01/02/2015","01/03/2015",..: 155 329 219 355 277 241 201 199 319 255 ...
##  $ price            : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms         : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms        : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living      : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot         : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors           : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view             : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition        : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade            : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above       : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement    : int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built         : num  1967 1990 2006 1967 1951 ...
##  $ yr_renovated     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode          : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ lat              : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long             : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15    : int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15       : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
##  $ dataType         : chr  "train" "train" "train" "train" ...
##  $ log_price        : num  5.35 5.81 5.91 5.52 5.72 ...
##  $ log_size         : num  3.1 3.35 3.6 3.28 3.26 ...
##  $ houseview        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ outdoor          : int  6250 330 205543 5650 3186 1966 43320 9060 4590 6980 ...
##  $ yr_sold          : num  2014 2014 2014 2015 2014 ...
##  $ house_age        : num  47 24 8 48 63 6 19 31 27 55 ...
##  $ renovated        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ house_cnt        : int  141 308 53 233 309 309 145 265 235 309 ...
##  $ log_price_avg_adj: num  5.46 5.74 5.74 5.58 5.77 ...
##  $ resp_grade       : num  -0.487 2.977 0.917 -0.201 -0.45 ...
##  $ resp_cond        : num  -0.583 -0.76 -0.503 0.737 0.657 ...
# Factorizing some variables
columns <- c( "floors", "waterfront", "view", "condition")
dataset[columns] <- lapply(dataset[columns], factor)

# Replacing True or False by 1 and 0 
dataset <- dataset%>% mutate(houseview = ifelse(houseview == FALSE, 0, 1))
test_fe <- dataset[ dataset$dataType == 'test',]
train_fe <- dataset[ dataset$dataType == 'train',]
m <- splitdf(train_fe , seed = 20)
train_set <- m$trainset
val_set <- m$testset
train_set <- train_set[,!(names(train_set) %in% c("dataType" , 'price' , 'date' , 'id' ))]
val_set <- val_set[,!(names(val_set) %in% c("dataType" , 'price' , 'date', 'id' ))]
names(val_set)
##  [1] "bedrooms"          "bathrooms"         "sqft_living"      
##  [4] "sqft_lot"          "floors"            "waterfront"       
##  [7] "view"              "condition"         "grade"            
## [10] "sqft_above"        "sqft_basement"     "yr_built"         
## [13] "yr_renovated"      "zipcode"           "lat"              
## [16] "long"              "sqft_living15"     "sqft_lot15"       
## [19] "log_price"         "log_size"          "houseview"        
## [22] "outdoor"           "yr_sold"           "house_age"        
## [25] "renovated"         "house_cnt"         "log_price_avg_adj"
## [28] "resp_grade"        "resp_cond"
names(train_set)
##  [1] "bedrooms"          "bathrooms"         "sqft_living"      
##  [4] "sqft_lot"          "floors"            "waterfront"       
##  [7] "view"              "condition"         "grade"            
## [10] "sqft_above"        "sqft_basement"     "yr_built"         
## [13] "yr_renovated"      "zipcode"           "lat"              
## [16] "long"              "sqft_living15"     "sqft_lot15"       
## [19] "log_price"         "log_size"          "houseview"        
## [22] "outdoor"           "yr_sold"           "house_age"        
## [25] "renovated"         "house_cnt"         "log_price_avg_adj"
## [28] "resp_grade"        "resp_cond"

Building the model

str(train_set)
## 'data.frame':    11518 obs. of  29 variables:
##  $ bedrooms         : int  3 4 3 4 4 2 4 3 3 3 ...
##  $ bathrooms        : num  2.75 2.5 2.5 3.25 2.75 1 2.75 1.5 1.5 2 ...
##  $ sqft_living      : int  3010 2200 1040 3420 2260 810 3410 1460 1260 1390 ...
##  $ sqft_lot         : int  12432 9099 1032 5012 12005 4368 95396 78408 7964 6005 ...
##  $ floors           : Factor w/ 6 levels "1","1.5","2",..: 1 3 5 3 1 1 2 1 1 3 ...
##  $ waterfront       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ view             : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ condition        : Factor w/ 5 levels "1","2","3","4",..: 4 3 3 3 4 3 4 4 4 3 ...
##  $ grade            : int  8 8 7 10 8 6 10 7 7 8 ...
##  $ sqft_above       : int  1890 2200 1040 2330 2260 810 3410 1460 1260 1390 ...
##  $ sqft_basement    : int  1120 0 0 1090 0 0 0 0 0 0 ...
##  $ yr_built         : num  1970 1994 2007 2008 1956 ...
##  $ yr_renovated     : int  0 0 0 0 1989 0 0 0 0 0 ...
##  $ zipcode          : int  98052 98034 98107 98019 98005 98115 98005 98075 98148 98022 ...
##  $ lat              : num  47.6 47.7 47.7 47.7 47.6 ...
##  $ long             : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15    : int  2500 2270 1290 2320 1870 1890 3721 3320 1510 1264 ...
##  $ sqft_lot15       : int  12432 8900 1275 5465 10800 5253 35352 7787 8776 5550 ...
##  $ log_price        : num  5.87 5.77 5.64 5.74 5.87 ...
##  $ log_size         : num  3.48 3.34 3.02 3.53 3.35 ...
##  $ houseview        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ outdoor          : int  9422 6899 -8 1592 9745 3558 91986 76948 6704 4615 ...
##  $ yr_sold          : num  2014 2015 2015 2014 2014 ...
##  $ house_age        : num  44 21 8 6 58 57 52 52 59 9 ...
##  $ renovated        : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ house_cnt        : int  313 284 131 90 94 309 94 195 35 119 ...
##  $ log_price_avg_adj: num  5.79 5.68 5.74 5.61 5.89 ...
##  $ resp_grade       : num  -0.189 0.504 -0.552 2.629 -0.407 ...
##  $ resp_cond        : num  1.277 -0.629 -0.609 -0.221 0.451 ...
baseline <- lm(formula = log_price ~ . , data = train_set )

summary(baseline)
## 
## Call:
## lm(formula = log_price ~ ., data = train_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55085 -0.04535  0.00238  0.04744  0.50907 
## 
## Coefficients: (4 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6.447e+01  3.708e+00 -17.387  < 2e-16 ***
## bedrooms          -6.417e-03  1.062e-03  -6.043 1.56e-09 ***
## bathrooms          1.785e-02  1.832e-03   9.745  < 2e-16 ***
## sqft_living        1.356e-05  3.670e-06   3.694 0.000222 ***
## sqft_lot           2.153e-07  2.722e-08   7.910 2.80e-15 ***
## floors1.5          1.318e-02  3.058e-03   4.312 1.63e-05 ***
## floors2            1.868e-03  2.455e-03   0.761 0.446686    
## floors2.5          2.344e-02  8.857e-03   2.646 0.008151 ** 
## floors3           -1.671e-02  5.502e-03  -3.037 0.002397 ** 
## floors3.5          1.567e-02  8.135e-02   0.193 0.847275    
## waterfront1        1.855e-01  1.058e-02  17.534  < 2e-16 ***
## view1              4.263e-02  6.247e-03   6.824 9.28e-12 ***
## view2              4.480e-02  3.848e-03  11.642  < 2e-16 ***
## view3              6.830e-02  5.207e-03  13.118  < 2e-16 ***
## view4              1.116e-01  7.905e-03  14.113  < 2e-16 ***
## condition2        -1.150e-02  2.445e-02  -0.470 0.638237    
## condition3         4.309e-02  2.400e-02   1.795 0.072626 .  
## condition4         4.990e-02  2.576e-02   1.937 0.052796 .  
## condition5         7.439e-02  2.790e-02   2.666 0.007678 ** 
## grade             -1.599e-02  2.785e-03  -5.742 9.57e-09 ***
## sqft_above         2.781e-05  2.514e-06  11.063  < 2e-16 ***
## sqft_basement             NA         NA      NA       NA    
## yr_built          -6.085e-04  4.518e-05 -13.469  < 2e-16 ***
## yr_renovated       1.526e-03  2.410e-04   6.332 2.51e-10 ***
## zipcode            8.777e-06  1.923e-05   0.456 0.648157    
## lat                1.282e-01  8.188e-03  15.664  < 2e-16 ***
## long              -1.078e-01  8.028e-03 -13.434  < 2e-16 ***
## sqft_living15      2.449e-05  2.002e-06  12.233  < 2e-16 ***
## sqft_lot15         1.036e-07  4.284e-08   2.418 0.015619 *  
## log_size           2.554e-01  1.534e-02  16.651  < 2e-16 ***
## houseview                 NA         NA      NA       NA    
## outdoor                   NA         NA      NA       NA    
## yr_sold            2.277e-02  1.618e-03  14.073  < 2e-16 ***
## house_age                 NA         NA      NA       NA    
## renovated         -3.023e+00  4.810e-01  -6.285 3.39e-10 ***
## house_cnt          5.986e-05  1.047e-05   5.716 1.12e-08 ***
## log_price_avg_adj  7.755e-01  1.024e-02  75.762  < 2e-16 ***
## resp_grade         6.228e-02  2.633e-03  23.654  < 2e-16 ***
## resp_cond          4.242e-03  2.752e-03   1.541 0.123298    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08074 on 11483 degrees of freedom
## Multiple R-squared:  0.8756, Adjusted R-squared:  0.8753 
## F-statistic:  2378 on 34 and 11483 DF,  p-value: < 2.2e-16
predicted_lm <- as.data.frame(predict(baseline , val_set))
## Warning in predict.lm(baseline, val_set): prediction from a rank-deficient
## fit may be misleading
MAPE_lm <- mean(abs((10^(val_set$log_price)-10^(predicted_lm$`predict(baseline, val_set)`))/10^(val_set$log_price)))
RMSE_lm <- sqrt(mean((10^(val_set$log_price)-10^(predicted_lm$`predict(baseline, val_set)`))^2))
print(paste0( 'The MAPE of the Linear Model is : ' , MAPE_lm , " and the RMSE is : " , RMSE_lm))
## [1] "The MAPE of the Linear Model is : 0.144899053375087 and the RMSE is : 139574.135054944"

Check for the most important variables

imp <- varImp(baseline)
imp
##                      Overall
## bedrooms           6.0431450
## bathrooms          9.7453620
## sqft_living        3.6937942
## sqft_lot           7.9103114
## floors1.5          4.3118611
## floors2            0.7609785
## floors2.5          2.6461973
## floors3            3.0367595
## floors3.5          0.1926008
## waterfront1       17.5339227
## view1              6.8243793
## view2             11.6421003
## view3             13.1181148
## view4             14.1131877
## condition2         0.4701783
## condition3         1.7953392
## condition4         1.9367845
## condition5         2.6663636
## grade              5.7423478
## sqft_above        11.0628665
## yr_built          13.4691122
## yr_renovated       6.3321317
## zipcode            0.4563364
## lat               15.6635334
## long              13.4340763
## sqft_living15     12.2330211
## sqft_lot15         2.4180500
## log_size          16.6510156
## yr_sold           14.0730860
## renovated          6.2852632
## house_cnt          5.7160489
## log_price_avg_adj 75.7624890
## resp_grade        23.6540469
## resp_cond          1.5411911

Random Forest Model :

str(train_set)
## 'data.frame':    11518 obs. of  29 variables:
##  $ bedrooms         : int  3 4 3 4 4 2 4 3 3 3 ...
##  $ bathrooms        : num  2.75 2.5 2.5 3.25 2.75 1 2.75 1.5 1.5 2 ...
##  $ sqft_living      : int  3010 2200 1040 3420 2260 810 3410 1460 1260 1390 ...
##  $ sqft_lot         : int  12432 9099 1032 5012 12005 4368 95396 78408 7964 6005 ...
##  $ floors           : Factor w/ 6 levels "1","1.5","2",..: 1 3 5 3 1 1 2 1 1 3 ...
##  $ waterfront       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ view             : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ condition        : Factor w/ 5 levels "1","2","3","4",..: 4 3 3 3 4 3 4 4 4 3 ...
##  $ grade            : int  8 8 7 10 8 6 10 7 7 8 ...
##  $ sqft_above       : int  1890 2200 1040 2330 2260 810 3410 1460 1260 1390 ...
##  $ sqft_basement    : int  1120 0 0 1090 0 0 0 0 0 0 ...
##  $ yr_built         : num  1970 1994 2007 2008 1956 ...
##  $ yr_renovated     : int  0 0 0 0 1989 0 0 0 0 0 ...
##  $ zipcode          : int  98052 98034 98107 98019 98005 98115 98005 98075 98148 98022 ...
##  $ lat              : num  47.6 47.7 47.7 47.7 47.6 ...
##  $ long             : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15    : int  2500 2270 1290 2320 1870 1890 3721 3320 1510 1264 ...
##  $ sqft_lot15       : int  12432 8900 1275 5465 10800 5253 35352 7787 8776 5550 ...
##  $ log_price        : num  5.87 5.77 5.64 5.74 5.87 ...
##  $ log_size         : num  3.48 3.34 3.02 3.53 3.35 ...
##  $ houseview        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ outdoor          : int  9422 6899 -8 1592 9745 3558 91986 76948 6704 4615 ...
##  $ yr_sold          : num  2014 2015 2015 2014 2014 ...
##  $ house_age        : num  44 21 8 6 58 57 52 52 59 9 ...
##  $ renovated        : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ house_cnt        : int  313 284 131 90 94 309 94 195 35 119 ...
##  $ log_price_avg_adj: num  5.79 5.68 5.74 5.61 5.89 ...
##  $ resp_grade       : num  -0.189 0.504 -0.552 2.629 -0.407 ...
##  $ resp_cond        : num  1.277 -0.629 -0.609 -0.221 0.451 ...
### Random Forest model tried , it improved the score and both the MAPE and RMSE were reduced

rf <- randomForest(formula = log_price ~ . , data = train_set)

summary(rf)
##                 Length Class  Mode     
## call                3  -none- call     
## type                1  -none- character
## predicted       11518  -none- numeric  
## mse               500  -none- numeric  
## rsq               500  -none- numeric  
## oob.times       11518  -none- numeric  
## importance         28  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y               11518  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call
predicted_rf <- as.data.frame(predict(rf , val_set))
MAPE_rf <- mean(abs((10^(val_set$log_price)-10^(predicted_rf$`predict(rf, val_set)`))/10^(val_set$log_price)))
RMSE_rf <- sqrt(mean((10^(val_set$log_price)-10^(predicted_rf$`predict(rf, val_set)`))^2))
print(paste0( 'The MAPE of the Random Forest Model is : ' , MAPE_rf , " and the RMSE is : " , RMSE_rf))
## [1] "The MAPE of the Random Forest Model is : 0.118297697513655 and the RMSE is : 132394.388623435"

Here XGBoost is tried, these parameters were tested and improved after understanding each one of them and adapting them to the data we have during multiple trials.

dtrain <- xgb.DMatrix(data.matrix(train_set[,-which(names(train_set) == "log_price")], rownames.force = NA), label = train_set$log_price)

dtest <- xgb.DMatrix(data.matrix(val_set[,-which(names(val_set) == "log_price")], rownames.force = NA), label = val_set$log_price)

watchlist <- list(train = dtrain, eval = dtest)

param <- list(max_depth = 6 , booster = 'gbtree',
              eta = 0.2, 
              silent = 1, 
              nthread = 2, 
              min_child_weight = 4 , 
              num_parallel_tree = 100 , 
              objective = "reg:linear", 
              eval_metric = "rmse")

xgboost <- xgb.train(param, dtrain, nrounds = 100, watchlist )
## [1]  train-rmse:4.136961 eval-rmse:4.141459 
## [2]  train-rmse:3.310920 eval-rmse:3.314828 
## [3]  train-rmse:2.650132 eval-rmse:2.653746 
## [4]  train-rmse:2.121501 eval-rmse:2.124947 
## [5]  train-rmse:1.698711 eval-rmse:1.701570 
## [6]  train-rmse:1.360606 eval-rmse:1.363010 
## [7]  train-rmse:1.090259 eval-rmse:1.092579 
## [8]  train-rmse:0.874113 eval-rmse:0.876011 
## [9]  train-rmse:0.701485 eval-rmse:0.703176 
## [10] train-rmse:0.563639 eval-rmse:0.565360 
## [11] train-rmse:0.453658 eval-rmse:0.455521 
## [12] train-rmse:0.366084 eval-rmse:0.368207 
## [13] train-rmse:0.296576 eval-rmse:0.299021 
## [14] train-rmse:0.241567 eval-rmse:0.244585 
## [15] train-rmse:0.198275 eval-rmse:0.201813 
## [16] train-rmse:0.164480 eval-rmse:0.168785 
## [17] train-rmse:0.138345 eval-rmse:0.143457 
## [18] train-rmse:0.118384 eval-rmse:0.124360 
## [19] train-rmse:0.103550 eval-rmse:0.110331 
## [20] train-rmse:0.092256 eval-rmse:0.099899 
## [21] train-rmse:0.084403 eval-rmse:0.092805 
## [22] train-rmse:0.078030 eval-rmse:0.087198 
## [23] train-rmse:0.074071 eval-rmse:0.083782 
## [24] train-rmse:0.070971 eval-rmse:0.081310 
## [25] train-rmse:0.068614 eval-rmse:0.079446 
## [26] train-rmse:0.067092 eval-rmse:0.078301 
## [27] train-rmse:0.065784 eval-rmse:0.077405 
## [28] train-rmse:0.064433 eval-rmse:0.076326 
## [29] train-rmse:0.063803 eval-rmse:0.075898 
## [30] train-rmse:0.063133 eval-rmse:0.075505 
## [31] train-rmse:0.062620 eval-rmse:0.075236 
## [32] train-rmse:0.061819 eval-rmse:0.074651 
## [33] train-rmse:0.061481 eval-rmse:0.074538 
## [34] train-rmse:0.060959 eval-rmse:0.074273 
## [35] train-rmse:0.060734 eval-rmse:0.074169 
## [36] train-rmse:0.059760 eval-rmse:0.073415 
## [37] train-rmse:0.059366 eval-rmse:0.073315 
## [38] train-rmse:0.059155 eval-rmse:0.073253 
## [39] train-rmse:0.058887 eval-rmse:0.073115 
## [40] train-rmse:0.058173 eval-rmse:0.072784 
## [41] train-rmse:0.057851 eval-rmse:0.072618 
## [42] train-rmse:0.057615 eval-rmse:0.072556 
## [43] train-rmse:0.057398 eval-rmse:0.072474 
## [44] train-rmse:0.057129 eval-rmse:0.072442 
## [45] train-rmse:0.057014 eval-rmse:0.072442 
## [46] train-rmse:0.056731 eval-rmse:0.072303 
## [47] train-rmse:0.056322 eval-rmse:0.072050 
## [48] train-rmse:0.055873 eval-rmse:0.071854 
## [49] train-rmse:0.055477 eval-rmse:0.071728 
## [50] train-rmse:0.055190 eval-rmse:0.071646 
## [51] train-rmse:0.055054 eval-rmse:0.071584 
## [52] train-rmse:0.054635 eval-rmse:0.071274 
## [53] train-rmse:0.054348 eval-rmse:0.071181 
## [54] train-rmse:0.054137 eval-rmse:0.071179 
## [55] train-rmse:0.053497 eval-rmse:0.070786 
## [56] train-rmse:0.053112 eval-rmse:0.070724 
## [57] train-rmse:0.052962 eval-rmse:0.070671 
## [58] train-rmse:0.052699 eval-rmse:0.070582 
## [59] train-rmse:0.052418 eval-rmse:0.070344 
## [60] train-rmse:0.052259 eval-rmse:0.070340 
## [61] train-rmse:0.051795 eval-rmse:0.070268 
## [62] train-rmse:0.051477 eval-rmse:0.070119 
## [63] train-rmse:0.051350 eval-rmse:0.070167 
## [64] train-rmse:0.051173 eval-rmse:0.070115 
## [65] train-rmse:0.051117 eval-rmse:0.070117 
## [66] train-rmse:0.050882 eval-rmse:0.070076 
## [67] train-rmse:0.050717 eval-rmse:0.069973 
## [68] train-rmse:0.050455 eval-rmse:0.069940 
## [69] train-rmse:0.050274 eval-rmse:0.069911 
## [70] train-rmse:0.050067 eval-rmse:0.069759 
## [71] train-rmse:0.049741 eval-rmse:0.069618 
## [72] train-rmse:0.049315 eval-rmse:0.069365 
## [73] train-rmse:0.049088 eval-rmse:0.069286 
## [74] train-rmse:0.048923 eval-rmse:0.069260 
## [75] train-rmse:0.048852 eval-rmse:0.069260 
## [76] train-rmse:0.048803 eval-rmse:0.069239 
## [77] train-rmse:0.048383 eval-rmse:0.069057 
## [78] train-rmse:0.048075 eval-rmse:0.068931 
## [79] train-rmse:0.047865 eval-rmse:0.068855 
## [80] train-rmse:0.047709 eval-rmse:0.068757 
## [81] train-rmse:0.047465 eval-rmse:0.068681 
## [82] train-rmse:0.047198 eval-rmse:0.068633 
## [83] train-rmse:0.046803 eval-rmse:0.068521 
## [84] train-rmse:0.046495 eval-rmse:0.068458 
## [85] train-rmse:0.046212 eval-rmse:0.068265 
## [86] train-rmse:0.045983 eval-rmse:0.068234 
## [87] train-rmse:0.045758 eval-rmse:0.068179 
## [88] train-rmse:0.045549 eval-rmse:0.068044 
## [89] train-rmse:0.045388 eval-rmse:0.068058 
## [90] train-rmse:0.045154 eval-rmse:0.067950 
## [91] train-rmse:0.045007 eval-rmse:0.067907 
## [92] train-rmse:0.044663 eval-rmse:0.067740 
## [93] train-rmse:0.044417 eval-rmse:0.067589 
## [94] train-rmse:0.044276 eval-rmse:0.067532 
## [95] train-rmse:0.044070 eval-rmse:0.067444 
## [96] train-rmse:0.043911 eval-rmse:0.067413 
## [97] train-rmse:0.043784 eval-rmse:0.067390 
## [98] train-rmse:0.043656 eval-rmse:0.067387 
## [99] train-rmse:0.043550 eval-rmse:0.067287 
## [100]    train-rmse:0.043468 eval-rmse:0.067261
# summary(xgboost)

predicted_xgboost <- as.data.frame(predict(xgboost , dtest))
MAPE_xg <- mean(abs((10^(val_set$log_price)-10^(predicted_xgboost$`predict(xgboost, dtest)`))/10^(val_set$log_price)))
RMSE_xg <- sqrt(mean((10^(val_set$log_price)-10^(predicted_xgboost$`predict(xgboost, dtest)`))^2))

print(paste0( 'The MAPE of the XGBoost Model is : ' , MAPE_xg , " and the RMSE is : " , RMSE_xg))
## [1] "The MAPE of the XGBoost Model is : 0.110969885696949 and the RMSE is : 118411.115362964"

Since XGBoost gave the best training score this is the model that will be used for predictions of the final test

d_f_test<- xgb.DMatrix(data.matrix(test_fe[,!(names(test_fe) %in% c("dataType" , 'price' , 'date' , 'id' ,'log_price'))], rownames.force = NA), label = test_fe$log_price)

test_fe$price_pred <- 10^predict(xgboost , d_f_test)

id_pred <- test_fe[,c('id', 'price_pred')]
write.csv(id_pred ,  'test_prediction.csv')